Introduction :

As a part of our course Datascience with R, we are working on a Business problem, in order to enhance our R programming skills and explore various libraries and improve our Visualization techniques. We are working on IWD dataset which comprise of responses of Customers in likert scale for over 35 questions divided over 10 Blocks.

Customer Segmentation With respect to Customer Satisfaction:

 

Dataset Features Description:

From the dataset we singled out Block-6 which contains Individual Satisfaction of Groups and Block-7 which contains Individual Satisfaction of Branches , which is related to customer satisfaction on the individual Product groups and services provides by the supermarket respectively .

 

Detailed discussion about Block -7:

In this section we will be analysing the 17 Questions (80 features) present in the Block-6 In 2 approaches:

 

1.Considering all the Features:

Approach:

Further ,we planned to use Kmeans algorithm on the questions, to obtain optimum clusters we used Elow Method on the 16 feature dataset and we got k=2 as optimum kas seen from the graph below.

We performed Kmeans with K=2 and plotted a bar plot which shows the segmentation of the customers who are homogenious on x axis we have Features and on y axis we have cluster results.

 

 

##     Overall          Snacks         Regional        Private     
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:3.833   1st Qu.:3.989   1st Qu.:4.000   1st Qu.:3.817  
##  Median :4.167   Median :3.989   Median :4.023   Median :3.817  
##  Mean   :4.109   Mean   :3.989   Mean   :4.023   Mean   :3.817  
##  3rd Qu.:4.667   3rd Qu.:3.989   3rd Qu.:4.333   3rd Qu.:3.817  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##      Vegan          Branded       FruitVeggies    BreadnBaked   
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:4.000   1st Qu.:4.000   1st Qu.:3.833   1st Qu.:4.000  
##  Median :4.285   Median :4.107   Median :4.141   Median :4.165  
##  Mean   :4.285   Mean   :4.107   Mean   :4.141   Mean   :4.165  
##  3rd Qu.:5.000   3rd Qu.:4.667   3rd Qu.:4.667   3rd Qu.:4.500  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##      Flesh          Sausages         Dairy           Sweets     
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:4.000   1st Qu.:4.000   1st Qu.:4.000   1st Qu.:4.000  
##  Median :4.193   Median :4.260   Median :4.333   Median :4.301  
##  Mean   :4.193   Mean   :4.260   Mean   :4.326   Mean   :4.301  
##  3rd Qu.:4.667   3rd Qu.:4.833   3rd Qu.:5.000   3rd Qu.:4.833  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##    Alcoholic       SoftDrinks      Cosmetics        Utensils    
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:4.194   1st Qu.:4.000   1st Qu.:4.121   1st Qu.:3.969  
##  Median :4.194   Median :4.225   Median :4.121   Median :3.969  
##  Mean   :4.194   Mean   :4.225   Mean   :4.121   Mean   :3.969  
##  3rd Qu.:4.194   3rd Qu.:4.800   3rd Qu.:4.121   3rd Qu.:3.969  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000

 

2.Filtering the Features:

From the above plot we can see that the Maximum Non-NA count is 2303 and minimum is 468.The questions AlcoholicDrinks ,‘Cosmetics’, ‘Organic’, ‘Snacks’, ‘Utensils’, ‘Vegan’, have significantly less count of non zeros. We are eleminating the Groups which are less than 1152 data points. Why..? Because more than 50% of the rows in these Question Groups are NA’s, therefore in order to understand the influence of these missing values, we removed the columns with large number of missing values and repeated the above analysis. Note that in all the other columns that are not removed, we fill the NA’s with mean value of those columns. This way we are left with 11 features (questions) in the dataset.

We performed Kmeans with K=2 and plotted a bar plot which shows the segmentation of the customers who are homogenious on x axis we have Features and on y axis we have cluster results.

 

 

Customer Segmentation using Demographic data

From the graph, we can see that a high number of customers are between the age 46-65 and very few customers of age group less than 25 and greater than 75.

From the barplot, we can see that the number of female customers is higher than male customers but the difference is not significant.

From the barplot, we can see that there are a lot of customers with income less than 1.250 euros and there is a downward trend as the income goes up the customer count goes down. Many customers are hesitant to mention their income for such cases it is filled with 99 value. This value is not useful for our calculation. We have made an assumption that people with the same age group have similar income. So for customers who are having 99 value, we filled with a mean value of their age group’s income.

From the graph, it is clear that almost 90 percent of the customers have no children and very few customers have 1 or 2 children.

From the graph, we can see that most of the customers have 1 or 2 people in the household. And the trend goes down when the number of people increases.

From the graph, we can see that the highest number of customers are from state Nordrhein-Westfalen which is around 500 and the lowest are from Bremen which is around 10 customers.

We have performed K-means clustering on our demographic data.We have figured out the optimal number of clusters by using the Elbow method and silhouette coefficient. From the silhouette graph, we see that the number of optimal clusters can be either 2 or 3 and from the Elbow method we have observed the bend at 4. But in the case of the Elbow method, we can choose the number ofclusters that are close to the bend. So, We consider 3 as the optimal number of clusters.

We have visualized our clustering result using popular dimensionality reduction techniques like PCA and t-sne. From the PCA graph, we can see that the clusters are well separated. But from the t-sne graph we see that there are some overlappings between the clusters even then we can identify the groups easily.

Summary of the clustering results
Groups Avg_Age Avg_Income Avg_children Avg_people count
group1 1 5 5 1 2 706
group2 2 5 5 1 2 1047
group3 3 5 13 2 3 550
## Don't know how to automatically pick scale for object of type haven_labelled/vctrs_vctr/double. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type haven_labelled/vctrs_vctr/double. Defaulting to continuous.

The above graphs don’t convey any information about the behavior of the clusters. The above table illustrates the similarity between customers. We have calculated the mean of the numeric data like age, income, number of children, and number of people. As we cannot calculate the mean for the categorical data like State and Gender. From the table, we observe that clusters 1 and 2 have no difference. But by plotting a graph between income and state we have drawn some new insights.

  1. Customer group1 has low to average income and they belong to states 1 to 7 (Baden-Württemberg to Hessesn).

  2. Customer group2 has low to average income and they belong to states 8 to 16 (Mecklenburg-Vorpommern to Thüringen).

  3. Customer group3 are high-income people of all the states.

when decision-makers want to implement a particular strategy. They will focus on a particular subgroup rather than whole customers in general. Because the needs are different for different customers. So, it is a good strategy to group customers based on some common aspects and implements strategies to improve their satisfaction.


Customer Satisfaction Index

Background

This section focuses on the development of CSI (Customer Satisfaction Index). It evaluates and assesses the dataset from the outlook that helps in comparative analysis of the provided superstores and products.

As the questionnaire has several sections, the sections that are directly linked to CSI development are from section 11 till section 26. The points or criterion that each section focuses on, are;

  1. Range and diversity
  2. Price and performance ratio
  3. Availability of the product

Moreover, some later sections consider three more criterion to evaluate the superstores and products, and they are;

  1. Quality
  2. Freshness
  3. Presentation

Each section addresses a unique category of products that are available and accessible across all the superstores considered in this questionnaire. The details regarding the sections and the products’ categories are as follows;

Next comes the scale or measurement criteria. The scale that is employed to gauge each questions in the above said section is called Likert Scale. Likert Scale is most commonly and widely used psychometric scale in questionnaires. In Likert Scale, all the possible options are grouped and normalized on a scale and it (scale) can be of any length. Most commonly used Likert Scale contains either 5 or 7 points scale.

In this questionnaires, for the above mentioned questions, a 5 points scale is used and that is as follows;

  1. Very dissatisfied
  2. Dissatisfied
  3. Neutral
  4. Satisfied
  5. Very satisfied

Preliminary Analysis

The previous section discusses the questions and details that are necessary for the development of CSI and measure customers’ satisfaction. This sections highlights the approaches and steps that are the considered in the early analysis of the data.

Initially we, without performing any preprocessing (dimensionality reduction, for instance), plotted all the superstores and products. Fig. 1 illustrates overall satisfaction level of stores across every product. As it is mentioned in the previous section that each product are being evaluated either using 3 or 6 questions, therefore, this initial plot takes average of all the questions in each product (or section) and plots them against stores. Horizontal axis represents products, vertical axis shows satisfaction level, and the lines represents the stores. In this plot, it is evident from a first glance that majority of the superstores share similar level of satisfaction level. For instance, for product Alokohal the stores except Vmarkt, Combi, and Familia all fall in the range from 4.0 – 4.25 approximately. Similarly, for the other stores the trend suggests that all the stores, except a few, are nearer to each other as they makes a chain. Nevertheless, there are a few stores that stands out in the graph. For instance, if we follow the trend for VMarkt, it has the lowest satisfaction for 5 (Alkohol, Alkoholfrei, BB, Og, and Suess) out of 15 products. Also, graph suggest that the store Andere has the lowest satisfaction level for product Gegenstaende. This plot is helpful in identifying which stores has the high and low satisfaction score in each product, but it is quite cumbersome and too difficult to decode any hidden information, if any.

Fig. 1

We then tried to look at the data from a different angle. Figure 2 uses R’s grid feature in that the stores a grouped in a way that helps in identifying the store the customers are most satisfied with. Horizontal axis represents products and vertical axis shows satisfaction scale. The important aspect of this plot is the usage of gradient color to distinguish different satisfaction levels, and due this approach the plot, at a first glance, gives a sense of Heatmap. Also because of the gradient colors, it is easy to identify any store or product that has low or high satisfaction level as compared to others. For instance, store Andere has the lowest satisfaction score, 3.48, for product Gegenstaende. Also, when comparing store Real with VMarkt, one can easily deduce that customers are more satisfied with Real than VMarkt. This plot, although helps in distinguishing the stores, it still does separate the stores that have high review counts than those having fewer reviews.

Fig. 2

Dimentionality Reduction

In the last section although we were able to identify some pattern but differentiating the stores that shares the same trend was hard to identify. This section focuses on identifying the hidden patterns in the data. It focuses on reduction of the dimensions so that data can be analyzed more accurately and precisely. To get rid of the data that contributes less and is not informative enough for the analysis, we employed dimensionality reduction approach and in this regard it employees PCA (Principle Component Analysis). This sections performs PCA in two ways. In the first approach it considers all the stores as principle components and plots products across them. Later, in the second approach the same approach is repeated for products where products are principle components and stores are the data points.

Fig. 3, in this plot, principle components refers to products and that is why there are 15 principle components mentioned. Also, in the graph the vectors with red color and caption represents the stores and the labels in black represents the products in the system. This graph gives some important insights about the relation between stores and products. For example, VMarkt, Nahfrish, and Feneberg are correlated as they all grow in similar direction. Similarly, JibiMarkt and Klasskock also are forming a group, and the rest except Tengu, MixMarkt, and Hit growing almost in similar direction and share more information together.

## Importance of components:
##                           PC1     PC2     PC3     PC4     PC5     PC6    PC7
## Standard deviation     4.8644 1.73760 1.15834 0.99171 0.97917 0.76054 0.6120
## Proportion of Variance 0.7395 0.09435 0.04193 0.03073 0.02996 0.01808 0.0117
## Cumulative Proportion  0.7395 0.83381 0.87574 0.90647 0.93644 0.95451 0.9662
##                            PC8    PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.55549 0.4898 0.44705 0.36101 0.32815 0.26619 0.15466
## Proportion of Variance 0.00964 0.0075 0.00625 0.00407 0.00337 0.00221 0.00075
## Cumulative Proportion  0.97586 0.9833 0.98960 0.99367 0.99704 0.99925 1.00000
##                             PC15
## Standard deviation     9.594e-15
## Proportion of Variance 0.000e+00
## Cumulative Proportion  1.000e+00
Fig. 3

Similarly, we plotted PCA, Fig. 4, for stores and products and this time we make stores as principle components. This plot also helps in finding out which stores share the information and can be helpful in comparative analysis. For example, Snacks, Gegenstaende, and Vegan are not correlated to the rest of the products. Similarly, by looking at the stores, it is evident that stores in the 2nd quadrant can be compared with each other, than considering them with the stores like VMarkt and Coop.

## Importance of components:
##                           PC1    PC2     PC3     PC4    PC5     PC6     PC7
## Standard deviation     3.1287 1.5034 0.92045 0.79499 0.6124 0.56308 0.47990
## Proportion of Variance 0.6526 0.1507 0.05648 0.04213 0.0250 0.02114 0.01535
## Cumulative Proportion  0.6526 0.8033 0.85975 0.90188 0.9269 0.94802 0.96337
##                            PC8     PC9   PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.35962 0.34352 0.3216 0.27778 0.23721 0.17512 0.15564
## Proportion of Variance 0.00862 0.00787 0.0069 0.00514 0.00375 0.00204 0.00161
## Cumulative Proportion  0.97200 0.97986 0.9868 0.99190 0.99565 0.99770 0.99931
##                           PC15
## Standard deviation     0.10153
## Proportion of Variance 0.00069
## Cumulative Proportion  1.00000
Fig. 4

The PCA plots, mentioned above, provide a decent overview about the stores and products that are interconnected and would produce more insight when plotting them together than with those that participates more in different principle component. Therefore, to further dig down through the selection that what stores should be clubbed together, we tried another plots that are as follows.

In this regard, Fig. 5 plots stores and their review count. As evident from the plot that the data is quite imbalance in terms of reviews. Some stores has more review as compared to others, and there is a significant difference between the stores that have more reviews than those that have fewer reviews.

We decided to use a threshold value to split the stores in two sections. The first group would have the stores that have a significant number of reviews, and the second group would have the stores with fewer reviews. But the problem is how to choose the cut-off value. We again referred to PCA plot where stores are chosen as principle components. The clusters there suggests that the stores with reviews’ count lesser than 200 can be grouped together and the rest can be moved to the first group.

Fig. 5

Similarly, since the above plot helped in splitting the stores, we repeated the same process for products. And, with the help of this plot, Fig. 6, and the PCA plot, Fig. 3, the cut-off value we came across is 1500. Store with reviews’ count less than 1500 belongs to second group and the rest belong to first group.

Fig. 6

Detailed Analysis

This section concentrates on the detailed analysis of the stores and products. In the last section we formed groups and split the stores and products in four groups that are follows;

The first two plots, Fig. 7 and Fig. 8, plot the stores that have reviews’ count higher than the threshold against both the product groups. These plots use R’s facet_wrap functionality to exploit data patterns and display data in more scattered/grouped form. Each sub-grid in this plot represents a satisfaction level for each individual store against all the available products’ categories. Horizontal axis represents product categories, and vertical axis shows satisfaction level for respective products’ category in each sub-grid. These plots also do use different colors to plot different satisfaction levels and due to this it is comparatively easy to distinguish the stores that are popular and having better customer reviews. In the first plot, Fig. 7, where all the stores have reviews’ count higher than 200, follow similar pattern. Customers’ reviews for Molke was high across every stores, and for Regional its low except for the Real as customers recorded better reviews for it than the rest. Furthermore, by looking at the steepness of the line, store Real and Norma has comparatively less steep trend than the rest.

In Fig. 8, where the same stores are plotted against the products having lesser reviews’ count than 1500, the trend is similar here as well. All the stores have better satisfaction level for Alkohol and relatively negative for Vegan. Contrary to the last plot, Amazon has better satisfaction level as compared to others except Real for which customers seemed to be more satisfied.

Fig. 7
Fig. 8

Now let’s plot stores and products data for Group 2 and Group 4. Fig. 9 plots stores having reviews’ count less than 200 against the products that have reviews’ count more than 1500. At a first glance it is evident that some stores have high satisfaction levels than others. Stores Alnatura, Bofrost, Familia, Globus, and Nettoschwarz have satisfaction level above 4.0 for all the mentioned products. At downside store VMarkt, has the lowest satisfaction level as it has no points that are marked as green color. VMarktis followed by store Coop and MixMarkt respectively. The important point that makes this plot different to the plots in the last section, where each product shared almost similar satisfaction level, is, in this plot one can easily distinguish the stores with the help of the satisfaction level.

Fig. 10 plots data for Group 4. This plot also follows the same trend as last one. Some stores has better satisfaction level. For instance, store Nasfrish and Alnatura has better satisfaction level for each product. The stores that has bad reviews are Andere and MixMarkt. The point that applies to this plot as well is that store can be recommended for all the list product, store Alnatura can be suggest if customer want to by the mentioned products.

Fig. 9

Fig. 10

The last section focuses on stores and helps in identifying the stores that have better satisfaction level. This section considers the same groups but concentrates on the detailed analysis of the data from products perspective. We consider the similar four groups in this section.

The first two plots, Fig. 11 and Fig. 12, plot the products that have reviews’ count higher than the threshold against both the stores’ groups. These plots use R’s facet_wrap functionality to exploit data patterns and display data in more scattered/grouped form. Each sub-grid in this plot represents a satisfaction level for each individual product against all the available stores. Horizontal axis represents stores, and vertical axis shows satisfaction level for respective stores in each sub-grid. These plots also do use different colors to plot different satisfaction levels and due to this it is comparatively easy to distinguish the products that are popular and having better customer reviews. In the first plot, where all the products have reviews’ count higher than 1500, product Molke is the most popular that is followed by Suess, Eigenmarke, Alkoholfrie and Wurst. The product for which the customer provided relatively bad reviews is Regional one that is followed by Marken, and Og.

The 2nd plot, Fig. 8, where the customers recorded fewer reviews for the products, Alokohol has the best reviews, and Vegan has comparatively worst remarks.

Fig. 11
Fig. 12

Similarly the same approach is used for the stores that have fewer reviews’ count than the agreed threshold. In the plots, Fig. 9 and Fig. 10, the stores for which the customers have recorded fewer reviews’ count are plotted against all the products. In data science it is hard to find patterns when data is not enough and it applies to these two plots. The first plot, Fig. 9, all the products shares almost the same trend. Although product Molke is better for the first couple of stores but overall it does not give much insight. Fig. 10 rather follows a pattern and as it passes through the products, satisfaction level for the products decreases but, still there is no clear patterns.

Fig. 13
Fig. 14

Finally, we tried to perform exploratory analysis from a different angle. It has been discussed that each product has been evaluated using either 3 or 6 different questions. These questions belong to following sections; 1. Diversity or range of products 2. Freshness of products 3. Goods availability 4. Presentation of goods 5. Price and performance 6. Quality of products

In Fig. 15 we incorporated above categories and perform analysis in this direction. Horizontal axis represents the stores and vertical axis show satisfaction level for each store against the questions’ categories mentioned above.

Fig. 15

The underlying motivation to perform such analysis is to find out the category that customers’ give more importance to. And, as evident from the plot that no customer have recorded reviews for questions that belong to “Diversity & Range”, “Goods Availability”, and “Price & Performance”.


Recommendation Potential

For recommendation task we considered this as multi label classification problem. From the previous analysis, we selected the best features which helps us to understand the customer mindset.

Pipeline

  1. Columns selection and feature data frame extraction.
  2. Feature Extraction
  3. Label Extraction
  4. Final Dataset preparation
  5. Partitioning and analyzing data set.
  6. Run ML algorithms and record Measures.

 

1.Columns selection and feature data frame extraction

In the data, we have 44 questions and their responses from the customers. Each question can be taken as a feature or the sub questions can also be taken as separate features. Based on the analysis did before, below are the features selected which gives the correct mindset and predict the store for the new user.

 

2.Feature Extraction

After extracting the respective data frames from the above columns selected, as most of them are haven labelled, extracted the average values for few questions and directly values for the sub questions as follows.

As the average values were for the question responses, replaced the NA with value 0.

 

3.Label Extraction

The labels for the dataset will be store names which customer selected as the main store. As the responses as haven labelled, extracted the store name using as_factor() as shown below.

store_cols <- c("F3_Haupteinkaufsstaette")
stores_df <- customerdata[store_cols]
stores_df = as_factor(stores_df)
head(stores_df)
## # A tibble: 6 x 1
##   F3_Haupteinkaufsstaette
##   <fct>                  
## 1 Netto (Rot)            
## 2 Rewe                   
## 3 Rewe                   
## 4 Aldi Nord              
## 5 Rewe                   
## 6 Rewe

 

4. Final DataSet Preparation

The features extracted and labels extracted as prepared as final dataset below. For few labels as there is no data for them, they have been ignored for the final dataset and the glimpse of data set is as follows.

data_set <- cbind(cust_loyality, br_loc_sat, findability, price_setting, quality_setting, 
                  assortment, quality, frische, value_for_money, availabity,
                  presentation_goods, 
                  stores_df)
data_set <- droplevels(data_set)
head(data_set)
##   cust_loyality br_loc_sat findability price_setting quality_setting assortment
## 1      2.666667   2.333333         2.0          4.75             4.2          4
## 2      4.666667   3.333333         5.0          2.50             4.6          5
## 3      4.000000   5.000000         3.0          3.50             3.6          4
## 4      4.333333   2.666667         5.0          5.00             4.4          5
## 5      4.000000   4.666667         3.0          3.50             3.8          5
## 6      4.333333   3.333333         3.5          2.75             2.4          5
##   quality frische value_for_money availabity presentation_goods
## 1       4       4               4          3                  3
## 2       5       5               4          5                  5
## 3       5       4               5          4                  5
## 4       5       5               5          5                  5
## 5       4       4               3          4                  5
## 6       5       5               2          5                  5
##   F3_Haupteinkaufsstaette
## 1             Netto (Rot)
## 2                    Rewe
## 3                    Rewe
## 4               Aldi Nord
## 5                    Rewe
## 6                    Rewe

 

5.Partitioning and analyzing the dataset

Used caret library for Machine learning algorithms. Partitioned the data 80% for training and 20% for validation. After the partitioning the data set, the training data set the samples were 1854. Distribution of features are as follows Distribution of labels are as follows.

# create a list of 80% of the rows in the original dataset we can use for training
validation_index <- createDataPartition(data_set$F3_Haupteinkaufsstaette, p=0.80, list=FALSE)
# select 20% of the data for validation
validation <- data_set[-validation_index,]
# use the remaining 80% of data to training and testing the models
data_set <- data_set[validation_index,]


################### Analysing Dataset ##########################
dim(data_set)
## [1] 1854   12
# list types for each attribute
sapply(data_set, class)
##           cust_loyality              br_loc_sat             findability 
##               "numeric"               "numeric"               "numeric" 
##           price_setting         quality_setting              assortment 
##               "numeric"               "numeric"               "numeric" 
##                 quality                 frische         value_for_money 
##               "numeric"               "numeric"               "numeric" 
##              availabity      presentation_goods F3_Haupteinkaufsstaette 
##               "numeric"               "numeric"                "factor"
#Summarize the class distribution
percentage <- prop.table(table(data_set$F3_Haupteinkaufsstaette)) * 100
cbind(freq=table(data_set$F3_Haupteinkaufsstaette), percentage=percentage)
##                 freq  percentage
## Edeka            333 17.96116505
## Rewe             307 16.55879180
## Lidl             271 14.61704423
## Aldi Süd         145  7.82092772
## Kaufland         220 11.86623517
## Netto (Rot)      174  9.38511327
## Aldi Nord        107  5.77130529
## Amazon             4  0.21574973
## Real              60  3.23624595
## Penny             88  4.74649407
## DM                 7  0.37756203
## Rossmann           4  0.21574973
## Globus            21  1.13268608
## Norma             27  1.45631068
## Combi              5  0.26968716
## Jibi Markt         1  0.05393743
## Hit                7  0.37756203
## Netto (Schwarz)    8  0.43149946
## Tegut              9  0.48543689
## Klaas+Kock         1  0.05393743
## Denn's             6  0.32362460
## Alnatura           4  0.21574973
## Budnikowsky        1  0.05393743
## Feneberg           1  0.05393743
## Famila             9  0.48543689
## Markant            2  0.10787487
## nah&frisch         1  0.05393743
## Mix Markt          1  0.05393743
## Andere            30  1.61812298
# summarize attribute distributions
summary(data_set)
##  cust_loyality     br_loc_sat      findability    price_setting  
##  Min.   :1.000   Min.   :0.3333   Min.   :1.000   Min.   :1.000  
##  1st Qu.:3.333   1st Qu.:3.0000   1st Qu.:3.500   1st Qu.:3.000  
##  Median :4.000   Median :3.3333   Median :4.000   Median :3.750  
##  Mean   :3.877   Mean   :3.6223   Mean   :3.934   Mean   :3.622  
##  3rd Qu.:4.333   3rd Qu.:4.6667   3rd Qu.:4.500   3rd Qu.:4.250  
##  Max.   :5.000   Max.   :5.0000   Max.   :5.000   Max.   :5.000  
##                                                                  
##  quality_setting   assortment       quality         frische     
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:3.400   1st Qu.:4.000   1st Qu.:4.000   1st Qu.:4.000  
##  Median :3.600   Median :4.000   Median :4.000   Median :4.000  
##  Mean   :3.692   Mean   :4.132   Mean   :4.247   Mean   :4.179  
##  3rd Qu.:4.000   3rd Qu.:5.000   3rd Qu.:5.000   3rd Qu.:5.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##                                                                 
##  value_for_money   availabity    presentation_goods F3_Haupteinkaufsstaette
##  Min.   :1.000   Min.   :1.000   Min.   :1.000      Edeka      :333        
##  1st Qu.:4.000   1st Qu.:3.000   1st Qu.:4.000      Rewe       :307        
##  Median :4.000   Median :4.000   Median :4.000      Lidl       :271        
##  Mean   :4.079   Mean   :3.978   Mean   :4.017      Kaufland   :220        
##  3rd Qu.:5.000   3rd Qu.:5.000   3rd Qu.:5.000      Netto (Rot):174        
##  Max.   :5.000   Max.   :5.000   Max.   :5.000      Aldi Süd   :145        
##                                                     (Other)    :404

 

Visualizing Dataset

x <- data_set[,1:11]
y <- data_set[,12]
head(y)
## [1] Netto (Rot) Rewe        Rewe        Aldi Süd    Edeka       Norma      
## 29 Levels: Edeka Rewe Lidl Aldi Süd Kaufland Netto (Rot) Aldi Nord ... Andere
levels(y)
##  [1] "Edeka"           "Rewe"            "Lidl"            "Aldi Süd"       
##  [5] "Kaufland"        "Netto (Rot)"     "Aldi Nord"       "Amazon"         
##  [9] "Real"            "Penny"           "DM"              "Rossmann"       
## [13] "Globus"          "Norma"           "Combi"           "Jibi Markt"     
## [17] "Hit"             "Netto (Schwarz)" "Tegut"           "Klaas+Kock"     
## [21] "Denn's"          "Alnatura"        "Budnikowsky"     "Feneberg"       
## [25] "Famila"          "Markant"         "nah&frisch"      "Mix Markt"      
## [29] "Andere"
# boxplot for each attribute on one image
par(mfrow=c(1,5))
for(i in 1:5) {
  boxplot(x[,i], main=names(data_set)[i])
}

# barplot for class breakdown

plot(y)

 

 

 

 

 

 

 

 

 

Work by Data Science Cubs